{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Example of topic classification in text documents\n\nThis example shows how to balance the text data before to train a classifier.\n\nNote that for this example, the data are slightly imbalanced but it can happen\nthat for some data sets, the imbalanced ratio is more significant.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setting the data set\n\nWe use a part of the 20 newsgroups data set by loading 4 topics. Using the\nscikit-learn loader, the data are split into a training and a testing set.\n\nNote the class \\#3 is the minority class and has almost twice less samples\nthan the majority class.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_20newsgroups\n\ncategories = [\n    \"alt.atheism\",\n    \"talk.religion.misc\",\n    \"comp.graphics\",\n    \"sci.space\",\n]\nnewsgroups_train = fetch_20newsgroups(subset=\"train\", categories=categories)\nnewsgroups_test = fetch_20newsgroups(subset=\"test\", categories=categories)\n\nX_train = newsgroups_train.data\nX_test = newsgroups_test.data\n\ny_train = newsgroups_train.target\ny_test = newsgroups_test.target"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from collections import Counter\n\nprint(f\"Training class distributions summary: {Counter(y_train)}\")\nprint(f\"Test class distributions summary: {Counter(y_test)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The usual scikit-learn pipeline\n\nYou might usually use scikit-learn pipeline by combining the TF-IDF\nvectorizer to feed a multinomial naive bayes classifier. A classification\nreport summarized the results on the testing set.\n\nAs expected, the recall of the class \\#3 is low mainly due to the class\nimbalanced.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.pipeline import make_pipeline\n\nmodel = make_pipeline(TfidfVectorizer(), MultinomialNB())\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.metrics import classification_report_imbalanced\n\nprint(classification_report_imbalanced(y_test, y_pred))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Balancing the class before classification\n\nTo improve the prediction of the class \\#3, it could be interesting to apply\na balancing before to train the naive bayes classifier. Therefore, we will\nuse a :class:`~imblearn.under_sampling.RandomUnderSampler` to equalize the\nnumber of samples in all the classes before the training.\n\nIt is also important to note that we are using the\n:class:`~imblearn.pipeline.make_pipeline` function implemented in\nimbalanced-learn to properly handle the samplers.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.pipeline import make_pipeline as make_pipeline_imb\n\nmodel = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Although the results are almost identical, it can be seen that the resampling\nallowed to correct the poor recall of the class \\#3 at the cost of reducing\nthe other metrics for the other classes. However, the overall results are\nslightly better.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(classification_report_imbalanced(y_test, y_pred))"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}